70 research outputs found

    Distributed Inference and Query Processing for RFID Tracking and Monitoring

    Get PDF
    In this paper, we present the design of a scalable, distributed stream processing system for RFID tracking and monitoring. Since RFID data lacks containment and location information that is key to query processing, we propose to combine location and containment inference with stream query processing in a single architecture, with inference as an enabling mechanism for high-level query processing. We further consider challenges in instantiating such a system in large distributed settings and design techniques for distributed inference and query processing. Our experimental results, using both real-world data and large synthetic traces, demonstrate the accuracy, efficiency, and scalability of our proposed techniques.Comment: VLDB201

    A Factorized Version Space Algorithm for "Human-In-the-Loop" Data Exploration

    Get PDF
    International audienceWhile active learning (AL) has been recently applied to help the user explore a large database to retrieve data instances of interest, existing methods often require a large number of instances to be labeled in order to achieve good accuracy. To address this slow convergence problem, our work augments version space-based AL algorithms, which have strong theoretical results on convergence but are very costly to run, with additional insights obtained in the user labeling process. These insights lead to a novel algorithm that factorizes the version space to perform active learning in a set of subspaces. Our work offers theoretical results on optimality and approximation for this algorithm, as well as optimizations for better performance. Evaluation results show that our factorized version space algorithm significantly outperforms other version space algorithms, as well as a recent factorization-aware algorithm, for large database exploration

    Capturing Data Uncertainty in High-Volume Stream Processing

    Get PDF
    We present the design and development of a data stream system that captures data uncertainty from data collection to query processing to final result generation. Our system focuses on data that is naturally modeled as continuous random variables. For such data, our system employs an approach grounded in probability and statistical theory to capture data uncertainty and integrates this approach into high-volume stream processing. The first component of our system captures uncertainty of raw data streams from sensing devices. Since such raw streams can be highly noisy and may not carry sufficient information for query processing, our system employs probabilistic models of the data generation process and stream-speed inference to transform raw data into a desired format with an uncertainty metric. The second component captures uncertainty as data propagates through query operators. To efficiently quantify result uncertainty of a query operator, we explore a variety of techniques based on probability and statistical theory to compute the result distribution at stream speed. We are currently working with a group of scientists to evaluate our system using traces collected from the domains of (and eventually in the real systems for) hazardous weather monitoring and object tracking and monitoring.Comment: CIDR 200

    Anomaly Detection and Explanation Discovery on Event Streams

    Get PDF
    International audienceAs enterprise information systems are collecting event streams from various sources, the ability of a system to automatically detect anomalous events and further provide human readable explanations is of paramount importance. In this position paper, we argue for the need of a new type of data stream analytics that can address anomaly detection and explanation discovery in a single, integrated system, which not only offers increased business intelligence, but also opens up opportunities for improved solutions. In particular , we propose a two-pass approach to building such a system, highlight the challenges, and offer initial directions for solutions

    Spark-based Cloud Data Analytics using Multi-Objective Optimization

    Get PDF
    International audienceData analytics in the cloud has become an integral part of enterprise businesses. Big data analytics systems, however, still lack the ability to take user performance goals and budgetary constraints for a task, collectively referred to as task objectives, and automatically configure an analytic job to achieve these objectives. This paper presents a data analytics optimizer that can automatically determine a cluster configuration with a suitable number of cores as well as other system parameters that best meet the task objectives. At a core of our work is a principled multi-objective optimization (MOO) approach that computes a Pareto optimal set of job configurations to reveal tradeoffs between different user objectives, recommends a new job configuration that best explores such tradeoffs, and employs novel optimizations to enable such recommendations within a few seconds. We present efficient incremental algorithms based on the notion of a Progressive Frontier for realizing our MOO approach and implement them into a Spark-based prototype. Detailed experiments using benchmark workloads show that our MOO techniques provide a 2-50x speedup over existing MOO methods, while offering good coverage of the Pareto frontier. When compared to Ottertune, a state-of-the-art performance tuning system, our approach recommends configurations that yield 26%-49% reduction of running time of the TPCx-BB benchmark while adapting to different application preferences on multiple objectives

    Fine-Grained Modeling and Optimization for Intelligent Resource Management in Big Data Processing

    Get PDF
    International audienceBig data processing at the production scale presents a highly complex environment for resource optimization (RO), a problem crucial for meeting performance goals and budgetary constraints of analytical users. The RO problem is challenging because it involves a set of decisions (the partition count, placement of parallel instances on machines, and resource allocation to each instance), requires multi-objective optimization (MOO), and is compounded by the scale and complexity of big data systems while having to meet stringent time constraints for scheduling. This paper presents a MaxCompute based integrated system to support multi-objective resource optimization via ne-grained instance-level modeling and optimization. We propose a new architecture that breaks RO into a series of simpler problems, new ne-grained predictive models, and novel optimization methods that exploit these models to make effective instance-level RO decisions well under a second. Evaluation using production workloads shows that our new RO system could reduce 37-72% latency and 43-78% cost at the same time, compared to the current optimizer and scheduler, while running in 0.02-0.23s

    Query Processing for Large-Scale XML Message Brokering

    No full text
    Copyright © 2005 b

    Learning based Web query processing

    No full text
    Unlike relational database systems that return exact data items for a query, most current Web search and query systems return URLs of pages that might contain the required information. Processing the URLs to dig out the required information becomes a task of users, which is both ineffective and costly. In this work, we propose a novel Web query processing approach with learning capabilities. Under this approach, user queries are posted in the form of keywords that may not precisely describe the query requirements. A general-purpose search engine such as Yahoo! is employed to obtain candidate query results in the form of URLs. The first few URLs with high ranking are returned to a user for browsing. Meanwhile, the query processor learns how the user navigates through hyperlinks and locates the query result within a Web page. With the learned knowledge, the query processor processes the rest URLs to produce precise query results without user involvement. The preliminary experimental results indicate that the approach can process a range of Web queries with satisfactory performance. The architecture of such a query processor, techniques of modeling HTML pages, and knowledge for navigation and queried segment identification are discussed. Experiments on the effectiveness of the approach, the required knowledge, and the training strategies are presented
    corecore